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Abstract. The verification of multithreaded software is still a challenge. 
This comes mainly from the fact that the number of thread interleav- 

ings grows exponentially in the number of threads. The idea that thread 
interleavings can be studied with a matrix calculus is a novel approach 
in this research area. Our sparse matrix representations of the program 
are manipulated using a lazy implementation of Kronecker algebra. One 
goal is the generation of a data structure called Concurrent Program 
Graph (CPG) which describes all possible interleavings and incorporates 
synchronization while preserving completeness. We prove that CPGs in 
general can be represented by sparse adjacency matrices. Thus the num- 
ber of entries in the matrices is linear in their number of lines. Hence 
efficient algorithms can be applied to CPGs. In addition, due to syn- 
chronization only very small parts of the resulting matrix are actually 
needed, whereas the rest is unreachable in terms of automata. Thanks to 
the lazy implementation of the matrix operations the unreachable parts 
are never calculated. This speeds up processing significantly and shows 
that this approach is very promising. 

Various apjjlications including data flow analysis can be performed on 
CPGs. Furthermore, the structure of the matrices can be used to prove 
properties of the underlying program for an arbitrary number of threads. 
For example, deadlock freedom is proved for a large class of programs. 

1 Introduction 

With the advent of multi-core processors scientific and industrial interest focuses 
on the verification of multithreaded applications. The scientific challenge comes 
from the fact that the number of thread interleavings grows exponentially in 
a program's number of threads. All state-of-the-art methods, such as model 
checking, suffer from this so-called state explosion problem. The idea that thread 
interleavings can be studied with a matrix calculus is new in this research area. 
We are immediately able to support conditionals, loops, and synchronization. 
Our sparse matrix representations of the program are manipulated using a lazy 
implementation of Kronecker algebra. Similar to [3] we describe synchronization 
by Kronecker products and thread interleavings by Kronecker sums. One goal 
is the generation of a data structure called Concurrent Program Graph (CPG) 
which describes all possible interleavings and incorporates synchronization while 



preserving completeness. Similar to CFGs for sequential programs, CPGs may 
serve as an analogous graph for concurrent systems. We prove that CPGs in 
general can be represented by sparse adjacency matrices. Thus the number of 
entries in the matrices is linear in their number of lines. 

In the worst-case the number of lines increases exponentially in the number 
of threads. Especially for concurrent programs containing synchronization this 
is very pessimistic. For this case we show that the matrix contains nodes and 
edges unreachable from the entry node. 

We propose two major optimizations. First, if the program contains a lot of 
synchronization, only a very small part of the CPG is reachable. Our lazy imple- 
mentation of the matrix operations computes only this part (cf. Subsect. 3.6). 
Second, if the program has only little synchronization, many edges not accessing 
shared variables will be present, which are reduced during the output process of 
the CPG (cf. Subsect. 3.7). Both optimizations speed up processing significantly 
and show that this approach is very promising. 

We establish a framework for analyses of multithreaded shared memory con- 
current systems which forms a basis for analyses of various properties. Differ- 
ent techniques including dataflow analysis (e.g. [23-25, 14]) and model checking 
(e.g. [6, 9] to name only a few) can be applied to the generated Concurrent Pro- 
gram Graphs (CPGs) defined in Section 3. Furthermore, the structure of the 
matrices can be used to prove properties of the underlying program for an arbi- 
trary number of threads. For example in this paper, deadlock freedom is proved 
for p-v-symmetric programs. 

Theoretical results such as [21] state that synchronization-sensitive and con- 
text-sensitive analysis is impossible even for the simplest analysis problems. Our 
system model differs in that it supports subprograms only via inlining and re- 
cursions are impossible. 

The outline of our paper is as follows. In Section 2 control flow graphs, edge 
splitting, and Kronecker algebra are introduced. Our model of concurrency, its 
properties, and important optimizations like our lazy approach are presented in 
Section 3. In Section 4 we give a client-server example with 32 clients showing 
the eSiciency of our approach. For a matrix with a potential order of 10^^ our 
lazy approach delivers the result in 0.43s. Section 5 demonstrates how dead- 
lock freedom is proved for p-v-symmetric programs with an arbitrary number of 
threads. An example for detecting a data race is given in Section 6. Section 7 is 
devoted to an empirical analysis. In Section 8 we survey related work. Finally, 
we draw our conclusion in Section 9. 

2 Preliminaries 
2.1 Overview 

Wo model shared memory concurrent systems by threads which use semaphores 
for synchronization. Threads and semaphores are represented by control flow 
graphs (CFGs). Edge Splitting has to be applied to the edges of thread CFGs 
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that access more than one shared variable. Edge sphtting is straight forward 
and is described in Subsect. 2.3. The resulting Refined CFGs (RCFGs) are rep- 
resented by adjacency matrices. These matrices are then manipulated by Kro- 
necker algebra. We assume that the edges of CFGs are labeled by elements of a 
semiring. Details follow in this subsection. Similar definitions and further prop- 
erties can be foimd in [16]. 

Semiring (£,+,•, 0, 1) consists of a set of labels £, two binary operations + 
and •, and two constants and 1 such that 

1. (£,+,0) is a commutative monoid, 

2. (£, •, 1) is a monoid, 

3. V^i, h, h & C:h-{h + h) =li-l2 + h-h and (^i + h) ■ h = h ■ h + h ■ h 
hold and 

4. VZ e £ : • Z = / • = 0. 

Intuitively, our semiring is a unital ring without subtraction. For each I G £ 
the usual rules are valid, e.g., 1 + = + 1 = 1 and 1 ■ I = I ■ 1 = I. In addition 
we equip our semiring with the unary operation *. For each I G C, I* is defined 
by r Ej>o^^ '^liere Z° = 1 and ■l = l-P for j > 0. Our set of labels 

£ is defined by £ = £v U £s, where £v is the set of non-synchronization labels 
and £s is the set of labels representing semaphore calls. The sets £v and £s 
are disjoint. The set £s itself consists of two disjoint sets £3^ and £s„. The first 
denotes the set of labels referring to P-calls, whereas the latter refers to V-calls 
of semaphores. 

Examples for semirings include regular expressions (cf. [26]) which can be 
used for performing dataflow analysis. 

2.2 Control Flow Graphs 

A Control Flow Graph (CFG) is a directed labeled graph defined by G = 
{V,E,ne) with a set of nodes V, a set of directed edges E C V x V, and a 
so-called entry node n^. £ V. We require that each n £ V is reachable through 
a sequence of edges from rig. Nodes can have at most two outgoing edges. Thus 
the maximiim number of edges in CFGs is 2 Wc will use this propc;rty later. 

Usually CFG nodes represent basic blocks (cf. [1]). Because our matrix calcu- 
lus manipulates the edges we need to have basic blocks on the edges. ^ Each edge 
e G E is assigned a basic block b. In this paper wc refer to them as edge labels 
as defined in the previous subsection. To keep things simple we use edges, their 
labels and the corresponding entries of the adjacency matrices synonymously. 

In order to model synchronization wc use semaphores. The corresponding 
edges typically have labels like pi and vi, where px and Vx & £s- Usually two 
or more distinct thread CFGs refer to the same semaphore to perform synchro- 
nization. The other labels arc elements from £v. The operations on the basic 
blocks are and * from the semiring defined above (cf. [26]). Intuitively, ■,+, 
and * model consecutive basic blocks, conditionals, and loops, respectively. 

^ We chose the incoming edges. 
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(a) Binary Semaphore (b) Counting 

Semaphore 

Fig. 1. Semaphores 

In Fig. 1(a) and 1(b) a binary and a counting semaphore are depicted. The 

latter allows two threads to enter at the same time. In a similar way it is possible 
to construct semaphores allowing n non-blocking P-calls. 



2.3 Edge Splitting 

A basic block consists of multiple consecutive statements without jumps. For 

our purpose we need a finer granularity as we would have with basic blocks 
alone. To achieve the required granularity we need to split edges. Shared variable 
accesses and semaphore calls may occur in basic blocks. For both it is necessary 
to split edges. This ensures a representation of possible context switches in a 
manner exact enough for our purposes. We say "exact enough" because by using 
basic blocks together with the above refinement, we already have coarsened the 
analysis compared to the possibilities on statement-level. Furthermore we do not 
lose any information required for the completeness of our approach. Anyway, 
applying this procedure to a CFG, i.e. splitting edges in a CFG, results in a 
Refined Control Flow Graph (RCFG). 

Let V be the set of shared variables. In addition, let a shared variable € V be 
a volatile variable located in the shared memory which is accessed by two or more 
threads. Splitting an edge depends on the number of shared variables accessed 
in the corresponding basic block. For edge e this number is being referred to as 
NSV(e). In the same way we refer to NSV(6) as the number of shared variables 
accessed in basic block h. If NSV(e) > 1, edge splitting has to be applied to edge 
e; the edge is used unchanged otherwise. 

If edge splitting has to be applied to edge e which has basic block b assigned 
and NSV(6) = k then the basic blocks 6i , . . . , 6^ represent the subsequent parts 
of h in such a way that V6i : NSV(6i) = 1, where 1 <i <k. Edges ej get assigned 
basic block bj, where 1 < j < fc. In Fig. 2 the splitting of an edge with basic 
block b and NSV(6) = fc is depicted. 

For semaphore calls (e.g. pi and vi) edge splitting is required in a similar 
fashion. In contrast to shared variable accesses we require that semaphore calls 
have to be the only statement on the corresponding edge. The remaining consec- 
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Fig. 2. Edge Splitting for Shared Variable Accesses 

utive parts of the basic block are situated on the previous and succeeding edges, 
respectively.^ 

The effects of edge spUtting for shared variables and semaphore calls can be 
seen in the data race example given in Section 6. Each RCFG depicted in Fig. 12 
is constructed out of one basic block (cf. Fig. 11). 

Note that edge splitting ensures that we can model the minimal required 
context switches. The semantics of a concurrent programming language allows 
usually more. For example consider an edge in a RCFG containing two consec- 
utive statements, where both do not access shared variables. A context switch 
may happen in between. However, this additional interleaving does not provide 
new information. Hence our approach provides the minimal number of context 
switches. 

Without loss of generality we assume that the statements in each basic block 
are atomic. Thus, we assume while executing a statement, context switching is 
impossible. In RCFGs the finest possible granularity is at statement-level. If, 
according to the program's semantic, atomic statements may access two or more 
shared variables, then we make an exception to the above rule and allow two 
or more shared variable accesses on a single edge. Such edges have at most one 
atomic statement in their basic block. The Kronccker simi (which is introduced 
in the next subsection) ensures that all interleavings are generated correctly. 

2.4 Synchronization and Generating Interleavings with Kronecker 
Algebra 

Kronecker product and Kronecker sum form Kroncckc;r algc;bra. In the following 
we define both operations, state properties, and give examples. In addition, for 
the Kronecker sum we prove a property which we call Mixed Sum Rule. 

We define the set of matrices M = {M = {mi_j) \ m.i_j G £}. In the remaining 
parts of this paper only matrices M G M will be used, except where stated 
explicitly. Let o(M) refer to the order^ of matrix M € A4. In addition we will 
use n-by-n zero matrices Z„ = {zij), where Vi, j : Zij = 0. 

^ Note that edges representing a call to a semaphore are not considered to access 

shared variables. 
^ A k-by-k matrix is known as square matrix of order k. 
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Definition 1 (Kronecker product). Given a m-by-n matrix A and a p-by-q 
matrix B, their Kronecker product denoted by A® B is a mp-by-nq block matrix 
defined by 

ai^iB ■ ■ ■ a\^nB 



A<»B 



^ttmAB ■ ■ ■ am,nB 



Example 1. 



Let A = ( ^'^ and B = 62 i 62 2 62 3 • The Kronecker product C = 

^"^'^ ""'''^ bs,2 bsj 



A (Si B is given by 



/ ai, 1^1,1 ai, 1^1,2 «i, 1^1,3 «i,2&i,i ai,2bi,2 ai,2^'i,3 \ 

ai,l&2,l ^1,162,2 oil, 1^2, 3 ai,2&2,l Ol,2&2,2 Ol,2?'2,3 

Ol, 1^3,1 ai,l&3,2 ai,l&3,3 01^2^3,1 ai,2^3,2 ai,2&3,3 

02,1^1,1 02,1^1,2 «2,1^'1,3 02, 2^*1,1 ^2,2^1,2 «2, 2^1,3 

^2, 1^2,1 ^2, 162,2 02,162,3 02,262,1 02,262,2 02,262,3 

\ 02,163,1 02,163^2 02,163^3 02,263,1 02,263,2 (12,263,3 / 

As stated in [18] the Kronecker product is also being referred to as Zehfuss 
product or direct product of matrices or matrix direct product. * 

In the following we list some basic properties of the Kronecker product. Proofs 
and additional properties can be found in [2, 10,7, 11]. Let A, i?, C, and D be 
matrices. The Kronecker product is noncommutative because in general A(S)B 
B^A. It is permutation equivalent because there exist permutation matrices P 
and Q such that A (E) B = P(B (E) A)Q. If A and B arc square matrices, then 
A(S B and B (S A are even permutation similar, i.e., P = . The product is 
associative as 

A0(B(8)C) = (A0B)(g)C. (1) 
In addition, the Kronecker product distributes over +, i.e., 

AE)iB + C)=A(g>B + AE)C, (2) 
{A + B)®C = A<S)C + B<eC. (3) 

Hence for example {A + B) (E) {C + D) = A E) C + B (g) C + A (g) D + B (g) D. 
The Kronecker product allows to model synchronization (cf. Subsect. 3.2). 

Definition 2 (Kronecker sum). Given a matrix A of order m and matrix 
B of order n, their Kronecker sum denoted by A(B B is a matrix of order mn 
defined by 



A®B = A®In + I„ 



B, 



Knuth notes in [15] tliat Kronecker never published anything about it. Zehfuss was 
actuaUy tlic first pubhshing it in the 19th century [27]. He proved that det(A(g)_B) = 
det"(^) ■ det'"(i?), if A and B are matrices of order m and n and entries from the 
domain of real numbers, respectively. 
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where Im and In denote identity matrices^ of order m and n, respectively. 

This operation must not be confused with the direct sum of matrices, group 
direct product or direct product of modules for which the symbol ® is used too. 

By calculating the Kroncckcr sum of the adjacency matrices of two graphs the 
adjacency matrix of the Cartesian product graph [12] is computed (cf. [15]). 

Example 2. We use matrices A and B from Ex. 1. The Kronecker sum A(B B is 
given by 

)■ 

\ 



_ 

i>2,2 ^2,3 
^3,2 ^3,3 / 



aia 

^2,3 
02,2 + ^3,3 / 

In the following we list basic properties of the Kronecker sum of matrices A, 
B, and C. Additional properties can be found in [20] or are proved in this paper. 
The Kronecker sum is noncommutative because for element-wise comparison 
in general A (B B ^ B ® A. Anyway it essentially commutes because from a 
graph point of view, the graphs represented by matrices A® B and B ® A are 
structurally isomorphic. 

Now we state a property of the Kronecker sum which we call Mixed Sum 
Rule. 

Lemma 1. Let the matrices A and C have order m and B and D have order 
n. Then we call 

{A(BB) + {C(BD) = {A + C)(B{B + D) 
the Mixed Sum Rule. 

Proof. By using Eqs. (2) and (3) and Def. 2 we get {A ® B) + {C ® D) = 
A^In+Im^B+C^In+Im^D = (^+C)(g)/„+/„(8)(B+£)) = {A+C)®{B+D). 

□ 

^ The identity matrix /„ is a n-by-n matrix with ones on the main diagonal and zeros 
elsewhere. 
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For example let the matrices A and B be written as A = X^jg/^i and B = 
^j^jBj, respectively. In addition, let the sets / and J have the same number 
of elements, i.e., |/| = \J\. By using the mixed sum rule we can write A(B B = 

We will frequently use the Mixed Sum Rule from now on without further 
notice. 

The Kronecker sum is also associative, as {A(B B)(BC and A(B {B (BC) are 
equal. 

Lemma 2. Kronecker sum is associative. 

Proof. In the following we will use 1^ = Im.n- Note that Z denotes zero 
matrices. We have 

A®{B®C) = A®{B® Io(c) + loiB) «> C) 
{adding Zo(A)} = {A + Z^^A)) ®{B® Io(c) + Io(b) ® C) 

{Lemma 1} = (A ® [B ® Io{c))) + {Zo(A) ® {Io(b) ® C)) 
{Eq.(l), Def.2} ^{A®{B ® I„^c))) + Io(A) ® lois) ® C 
{ass.+, Def.2} = A® Io(b).o{c) + Io{A) ®B® I^^^c) + 

Io(A).o(B) <?> C 

{comm. oi+} = A® Io(b) Io(C) + Io{A).o(b) ®C + 

Io(A) «) B (g) 

{Def. 2} = {{A ® Io(B)) © C) + Io[A) ®B® Io(c) 
{Dei. 2} = {{A <g) loiB)) ® C) + {{Io(A) 'S)B)(B Z^^c) 

{Lemma l} = {A(g, I^^b) + Io(a) » B) ® {C + Z„^c)) 

{rm. = {A(g, I„^b) + Io{A) ®B)®C 

{Def. 2} = {A®B)® C. 

□ 

The associativity properties of the operations (g) and ® imply that the k-fold 
operations 

k k 

0Ai and 

i=l i=l 

are well defined. 

Note that Kronecker sum calculates all possible intcrleavings (see e.g. [17] 
for a proof). Note that this is true even for general CFGs including conditionals 
and loops. The following example illustrates interleaving of threads and how 
Kronecker sum handles it. 

Example 3. Let the matrices C and D be defined as follows: 

/O a 0\ /O c 0\ 

C= 006 1)= OOd . 
\0 0/ \0 0/ 
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(a) C (b) D 



Interleavings 



a ■ b ■ c 
a ■ c-b 
a ■ c - d 
c - a-b 
c - a - d 
c - d - a 




(c) Interleavings 



{d)C®D 



Fig. 3. A Simple Example 

The graph corresponding to matrix C is depicted in Fig. 3(a), whereas the graph 
of matrix D is shown in Fig. 3(b). The regular expressions associated to the 
CFGs are a ■ b and c • d, respectively. All possible interleavings by executing C 
and D in an interleavings semantics are shown in Fig. 3(c). In Fig. 3(d) the 
graph represented by the adjacency matrix C D is depicted. It is easy to see 
that all possible interleavings are generated correctly. 



3 Concurrent Program Graphs 

Our system model consists of a finite number of threads and a finite number 
of semaphores. Both, threads and semaphores, are represented by CFGs. The 
CFGs arc stored in form of adjacency matrices. The matrices have entries which 
are referred to as labels I G £ as defined in Subsect. 2.1. Let S and T be the sets 
of adjacency matrices representing semaphores and threads, respectively. The 
matrices arc manipulated by using Kroneckcr algebra. Similar to [3] we describe 
synchronization by Kronecker products and thread interleavings by Kronecker 
sums. Note that higher synchronization features of programming languages such 
as Ada's rendezvous can be simulated by our system model as the runtime system 
uses semaphores provided by the operating systems to implement them. 
Formally, the system model consists of the tuple {T,S,jC), where 

— T is the set of RCFG adjacency matrices describing threads, 

— iS is the set of CFG adjacency matrices describing semaphores, and 

— £ is the set of labels out of the semiring defined in Subsect. 2.1. The labels 
in T e T are elements of £, whereas the labels in S G S are elements of £s- 

A Concurrent Program Graph (CPG) is a graph C = (V, E, Ue) with a set 
of nodes V, a set of directed edges E C V x V, and a so-called entry node 
He £ V. The sets V and E are constructed out of the elements of {T,S,£). 
Details on how we generate the sets V and E follow in the next subsections. 
Similar to RCFGs the edges of CPGs arc labeled by I G C. Assuming without 
loss of generality that each thread has an entry node with index 1 in its adjacency 
matrix t gT, then the entry node of the generated CPG has index 1, too. 
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Threads Shared 
ti,. . . ,tk Variables V 



CFGs 



Edge 
Splitting 



RCFGs 
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r(») e r 



Semaphores 

Sl, . . . , Sj- 



CFGs 



5('' e 5 



o 



Fig. 4. Overview 

In Fig. 4 an overview of our approach is given. As described in Subsect. 2.3 
the set of shared variables V is used to generate T. 



3.1 Generating a Concurrent Program's Matrix 

Let T^'' e T and S^^^ G S refer to the matrices representing thread i and 
semaphore i, respectively. Let M — (niij) € A^. In addition, we define the 
matrix M; as the matrix with entries of M equal to / and zeros elsewhere: 

Ml = {m,,,j), where m,,,, = | ^ otherwise' 
We obtain the matrix representing the k interleaved threads as 

k 



T = 0r«, where tW g T. 



i=l 

According to Fig. 1 we have for the binary and the counting semaphore an 
adjacency matrix of order two and three, respectively. If we assume that the 
ith and the jth semaphore, where I < i,j < r, are a binary and a counting 
semaphore, respectively, then we get the following adjacency matrices. 

Pj ' 



5« = (0/j) and5«= L, 0%, 



In a similar fashion we can model counting semaphores of higher order. 
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The matrix representing the r interleaved semaphores is given by 

r 

5 = 5^, where 5^6 5. 

The adjacency matrix representing program V referred to as P is defined as 
P = To5= ^ (T;05O+ ^ (T,e50. (4) 

When applying the Kronecker product to semaphore calls we follow the rules 
Vx-Vx = Vx and Px-Px= Px- 

In Subsect. 3.5 we describe how the o-operation can be implemented effi- 
ciently. 

3.2 o-Operation and Synchronization 

Lemma 3. Let T = ®*L]^ T^'^ be the matrix representing k interleaved threads 
and let S be a binary semaphore. Then T o S correctly models synchronization 
ofT with semaphore S.^ 

Proof. First we observe that 

1. the first term in the definition of Eq. (4) replaces 

— each p in matrix T with ^ ^ and 

— each V in matrix T with ( *^ ^ | , 

2. the second term replaces each m e £v with | , and 

' m ' 



3. both terms replace each by 



0^ 

ooy 



According to the replacements above the order of matrix T o S has doubled 
compared to T. 

Now, consider the paths in the automaton underlying T described by the 
regular expression 

77 = I > m] [ply m] V [y m] ] . 

By the observations above it is easy to see that paths containing tt are present 
in To 5. On the other hand, paths not containing tt are no more present in To 5. 
Thus the semaphore operations always occur in (p, v) pairs in all paths in To 5. 
This, however, exactly mirrors the semantics of synchronization via a semaphore. 

□ 



Note that we do not make assumptions concerning the structure of T. 
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Generalizing Lemma 3, it is easy to see that the synchronization property is 
also correctly modeled if we replace the binary semaphore by one which allows 
more than one thread to enter it. In addition, the synchronization property is 
correctly modeled even if more than one semaphore is present on the right-hand 
side of T o S*. 

As a byproduct the proof of Lemma 3 shows the following corollary. 

Corollary 1. // the program modeled by T o S contains a deadlock, then the 
matrix T o S will contain a zero line i. Node £ in the corresponding automaton 
is no final node and does not have successors. 

Thus deadlocks show up in CPGs as a pure structural property of the underlying 
graphs. Nevertheless, false positives may occur. From a static point of view, a 
deadlock is possible while conditions exclude this case at runtime. Our approach 
delivers a path to a deadlock in any case. Nevertheless, our approach of finding 
deadlocks is complete. If it states deadlock freedom, then the program under 
test is certainly deadlock free. 

A further consequence of Lemma 3 is that after applying the o-operation only 
a small part of the underlying automata can be reached from its entry node. This 
allows for optimizations discussed later. 



3.3 Unreachable Parts Caused by Synchronization 

In this subsection we show that synchronization causes imrcachable parts. As 
an example consider Fig. 5. The program consists of two threads, namely Ti 
and T2. The RCFGs of the threads are shown in Fig. 5(a) and Fig. 5(b). The 
used semaphore is a binary semaphore similar to Fig. 1(a). Its operations are 
referred to as pi and vi. We denote a P and V-call to semaphore x of thread 
t as t.px and t.v^, respectively. Ti and T2 access the same shared variable in a 
and 6, respectively. The semaphore is used to ensure that a and b are accessed 
mutually exclusively. Note that a and b may actually be subgraphs consisting of 
multiple nodes and edges. 

For the example we have the matrices 












/O pi 










a 










b 











Vl 










\0 









[p 








and S ■ 



pi 
Vl 



Then we obtain the matrix T = Ti <S) T2, a matrix of order 16, consisting of the 
submatrices defined above and zero matrices of order four (instead of Z4 simply 
denoted by 0) as follows. 



( T2 pi-h \ 

T2 a-h 

T2 vi-h 

V T2 J 
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In order to enable a concise presentation of T o 5* we define the matrices 



U = 



/O pi 0\ 
00000000 
0000 5000 

ooooo&oo 

00000000 

000 OOwiO 
00000000 
\000 0/ 



,v = 



/0pi 000000\ 
00000000 
000 Pi 0000 
00000000 
00000 Pi 00 
00000000 
0000000 Pi 

\0 0000000/ 



W = a ■ Is, and X = 



/ooooooooX 

t;i 0000000 
00000000 
00 wi 00000 
00000000 
0000 ui 000 
00000000 
\ 7-1 0/ 



of order 8. 



Then we obtain the matrix T o S, a matrix of order 32, consisting of the sub- 
matrices defined above and zero matrices of order eight (instead of Zs simply 
denoted by 0) as follows. 



The generated CPG is depicted in Fig. 5(c). The resulting adjacency matrix 

has order 32, whereas the resulting CPG consists only of 12 nodes and 12 edges. 
Large parts (20 nodes and 20 edges) are unreachable from the entry node. In 
Fig. 6 these unreachable parts are depicted. 

In general, unreachable parts exist if a concurrent program contains synchro- 
nization. If a program contains a lot of synchronization the reachable parts may 
be very small. This observation motivates the lazy implementation described in 
Subsect. 3.6. 

3.4 Properties of the Resulting Adjacency Matrix 

In this subsection we prove interesting properties of the resulting matrices. 

A short calculation shows that the Kronecker sum in general generates at 
most mn^+nm^ — nm non-zero entries.'' Stated the other way, at least {mn)^ — 
m'n? — nrn? + mn entries are zero. We will see that CFGs and RCFGs contain 
even more zero entries. We will prove that for this case the number of edges 
is in O [mn) . Thus, the number of edges is linear in the order of the resulting 
adjacency matrix. 

^ Assuming the corresponding matrices have an order of m and n, respectively. 
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Lemma 4 (Maximum Number of Nodes). Given a program V consisting 
of k > threads {ti,t2, ■ ■ ■ ,tk), where each ti has n nodes in its RCFG, the 
number of nodes in V 's adjacency matrix P is hounded from above by n'^ . 

Proof. This follows immediately from the definitions of (g) and ®. For both the 
order of the resulting matrix is given by the multiplication of the orders of the 

input matrices. □ 

Definition 3. Let M = [rriij) e M.. We denote the number of non-zero entries 

by \\M\\ = |{m,,j|m,,j ^0}|'. 

For a RCFG with n nodes it is easy to see that it contains at most 2n edges. 

Lemma 5 (Maximum Number of Entries). Let a program represented by 
Mk G M consisting of k > threads be represented by the matrices T*^*) G T , 
where each T^^^ has order n. Then ||Mfc|| is bounded from above by 2kn^. 

Proof. We prove this lemma by induction on the definition of the Kronecker sum. 
For fc = 1 the lemma is true. If we assume that for m threads ||M„|| < 2m n"^, 
then for m + 1 threads \ \Mm+i 1 1 < 2m • n + • 2n = 2(m + 1) n™+i . Thus, 
we have proved Lemma 5. □ 

Compared to the full matrix of order n'^ with n^*^ entries the resulting matrix 
has significantly fewer non-zero entries, namely 2k n'^. By using the following 
definition we will prove that the matrices are sparse. 

Definition 4 (Sparse Matrix). We call a n-by-n matrix M sparse if and only 
if\\M\\ = 0{n). 

Lemma 6. CFGs and RCFGs have Sparse Adjacency Matrices. 

Proof. Follows from Subsect. 2.2 and Def. 4. □ 

Lemma 7. The Matrix P of a Program V is Sparse. 

Proof. Let T = 0^^^ T^^^ e be a N-by-N adjacency matrix of a program. We 
require that each of the k threads has order n in its adjacency matrix T^^\ Prom 
Lemma 5 we know ||r|| = 0{2kn''). In addition, N = is given by Lemma 4. 
Hence, for k threads and by using Definition 4 we get ||T|| < 2k = 2k N = 
0{N). A similar result holds for S and P = T o 5. □ 

Lemma 7 enables the application of memory saving data structures and effi- 
cient algorithms. Algorithms may for example work on adjacency lists. Clearly, 

the space requirements for the adjacency lists are linc;;ir in the number of nodes. 
In the worst-case the number of nodes increases exponentially in the number of 
threads. 
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3.5 EfHcient Implementation of the o-Operation 



This subsection is devoted to an efficient implementation of the o-operation. 
First we define the Selective Kronecker product which we denote by 0. This 
operator synchronizes only identical labels Z € £s of the two input matrices. 

Definition 5 (Selective Kronecker product). Given two matrices A and 
B we call A (Z)l B their Selective Kronecker product. For all I € L C £. let 
A(Z)lB = (aij) (Z)L {bp^g) = (cj.pj.g), where 

I if aij = bp^q = I Al € L, 
otherwise. 

Definition 6 (Filtered Matrix). We call Ml a Filtered Matrix and define 
it as a matrix of order o{M) containing entries I ^ L <Z C of M and zeros 
elsewhere as follows. 

I ^ , [l if mi i = I Al & L, 

Ml = {mL.,j), where m^-.j = | ^ othe^ise. 

Note that 

^ (T,0 5O=T0£g5. (5) 

In the following wc use o{Scy) — 111=1 o{S^'^^) = o{S). Note that S con- 
tains only labels / e £s- Hence, when the o-operator is applied for a label 
/ e £y, we get Si = Z^^s), i.e. a zero matrix of order o{S). Thus we obtain 
S;e£v ® "^'^ " -^-^v ® ^o{S)- We wiU prove this below. 

Finally, we can refine Eq. (4) by stating the following lemma. 

Lemma 8. The o-operation can he computed efficiently by 

P = ToS = T(Z)CsS + Tcy® Io{sy 

Proof Using Eq. (4) P = T o 5 is given by E {Ti ® Si) + E (^z 6 Si) . 

According to Eq. (5) the ffist term is equal to TCdcg S. By mentioning Si = Zg(^s) 
for I e Cy, Lemma 1, and Def. 2, the second term fulfills. 

{Ti®Si)= J2 {Ti ® Zois)) = Tcy ® ^o(s) = ® 4(5) • 

Note that S contains only Z e >Cs- It is obvious that the non-zero entries of 

the first and the second term are I G Cs and I E C\i. respectively. Both terms 
can be computed by iterating once through the corresponding sparse adjacency 
matrices, namely T and S. □ 
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3.6 Lazy Implementation of Kronecker Algebra 

Until now we iiave primarily focused on a pure mathematical model for shared 
memory concurrent systems. An alert reader will have noticed that the order of 
the matri(x;s in our CPG increases exponentially in the number of threads. On 
the other hand, we have seen that the o-operation results in parts of the matrix 
T o S that cannot be reached from the entry node of the underlying automaton 
(cf Subsect. 3.3). This comes solely from the fact that synchronization excludes 
some intcrlcavings. 

Choosing a lazy implementation for the matrix operations, however, ensures 
that, when extracting the reachable parts of the underlying automaton, the 
overall effort is reduced to exactly these parts. By starting from the entry node 
and calculating all reachable successor nodes our lazy implementation exactly 
does this. Thus, for example, if the resulting automaton's size is linear in terms of 
the involved threads, only linear effort will be necessary to generate the resulting 
automaton. 

Our implementation distinguishes between two kind of matrices: Sparse ma- 
trices are used for representing threads and semaphores. Lazy matrices are em- 
ployed for representing all the other matrices, e.g. those resulting from the op- 
erations of the Kronecker algebra and our o-operation. Besides the employed 
operation, a lazy matrix simply keeps track of its operands. Whenever an en- 
try of a lazy matrix is retrieved, depending on the operation recorded in the 
lazy matrix, entries of the operands arc retrieved and the recorded operation is 
performed on these entries to calciilatc the resiilt. In the course of this compu- 
tation, even the successors of nodes are evaluated lazily. Retrieving entries of 
operands is done recursively if the operands are again lazy matrices, or is done 
by retrieving the entries from the sparse matrices, where the actual data resides. 

In addition, our lazy implementation allows for simple parallelizing. For ex- 
ample, retrieving the entries of left and right operands can be done concurrently. 
Exploiting this, we expect further performance improvements for our implemen- 
tation if run on multi-core architectures. 

3.7 Optimization for NSV 

Our approach already works fine for practical settings. In this subsection we 
present additional optimizations which are optional. 

As already mentioned in Subsect. 2.4 the Kronecker sum interleaves all en- 
tries. Sometimes this is disadvantageous because irrelevant interleavings will be 
generated if some basic blocks do not access shared variables. Such basic blocks 
can be placed freely as long as other constraints do not prohibit it. 

For example consider the CFGs in Fig. 3. Assume for a moment that a, b, 
c, and d do not access shared variables. Then the overall behavior of the C- 
I?-systcm can be described correctly by choosing one of the six interleavings 
depicted in Fig. 3(d), e.g., by a ■ b ■ c ■ d. Hence the size of the CPG is reduced 
from nine nodes to five. 
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(a) E (b) F (c) E®F 

Fig. 7. A Counterexample 

From now on we divide set jCy into two disjoint sets jCsv and £nsv depending 
on whether the corresponding basic blocks access shared variables or not. 
The following example shows that NSV-edges cannot always be eliminated. 

Example 4- In this example we use the graphs depicted in Fig. 7. The graphs E 
and F form the input graphs. It is assumed that a is the only edge not accessing 
a shared variable. All graphs have Node 1 as entry node. We show that it is not 
sufficient to chose exactly one NSV-edge. The matrix E (B F is given by 

/O c a 0\ 

OOOaOOOO 

cpOOO 

OOOOOpOO 

OOOOOc&O ■ 

00000006 

wOOOOOOc 
yOi; 00/ 

The graph represented hy E(BF which is structurally isomorph to {E(BF)oS 
is depicted in Fig. 7(c). Both loops in the CPG must be preserved. Otherwise the 
program would be modeled incorrectly. By removing an edge labeled by a, we 
would change the program behavior. Thus it is not sufficient to use only one 
edge labeled by a. 
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In general, the only way to reduce the size of the resulting CPG is by studying 
the matrix T o S. One way would be to output the automaton from T o S and 
try to find reductions afterwards. We decided to perform such reductions during 
the output process such that a unnecessarily large automaton is not generated. 
It turned out that the problems to be solved to perform these reductions are 
hard. This will be discussed in detail below. 

Fig. 8 shows the algorithm employed for the output process in pseudo code. 
By £{n i) we denote the label assigned to edge (n — )• i). In short, the algorithm 
records all NSV-edges and proceeds until no other edges can be processed. Then 
it chooses one label of the NSV-edges. From the set of all recorded edges with this 
label a subset is determined such that all the edges in the subset can be reached 
from all nodes that have been processed so long. This is a necessary condition, if 
we want to eliminate the edges outside the subset. Determining a minimal subset 
under this constraint, however, is known as the Set Covering Problem which is 
NP-hard. We decided to implement a greedy algorithm. However, it turned out 
that in most cases we encountered a subset of size one, which trivially is optimal. 

If no subset can be found, no edges can be eliminated. 

Concerning Ex. 4 wc note that the reason why none of the NSV-edges can be 
eliminated, can be found in the presence of the loop in E. Our output algorithm 
traverses the CPG in such a way that we do not know in advance if a loop will be 
constructed later on. Hence our algorithm has to be aware of loops that will be 
constructed in the future. This is done by remembering eliminated edges which 
will be reconsidered if a suitable loop is encountered. 

In detail, if edges can be eliminated, we remember the set of eliminated edges 
TZ in set RECONSIDER together with a copy of the current set DONE. If later on 
we encounter a path in the CPG that reaches some nodes in this set DONE, we 
have to reconsider our decision. In this case all edges in TZ are reconsidered for 
being present in the CPG. Note that several RECONSIDER-sets can be affected 
if such a "backedgc" is found. Note also that this reconsider mechanism handles 
Ex. 4 correctly. 

Our implementation showed that the decision which label is chosen in Line 29 
is also crucial. The number of edges (and nodes) being eliminated heavily de- 
pends on this choice. We are currently working on heuristics for this choice. 

In the following we execute the algorithm on the example of Fig. 3 under 
the above conditions, i.e., a, 6, c, and d do not access shared variables. At the 
beginning we have TBD = {1} and TBDNSV(a) = TBDNSV(6) = TBDNSV(c) = 
TBDNSV(d) DONE = 0. Since RECONSIDER-sets are not necessary in this 
example, we do not consider them in the following to keep things simple. 

The 1st iteration finds NSV-edges only. So: TBDNSV(a) = {(1 4)}, 
TBDNSV(c) = {(1 2)}, DONE = {1} and the other sets arc empty. 

The 2nd iteration chooses label a in Line 29. SUBSET clearly is {(1 — >■ 4)}, 
TBD = {4}, and TBDNSV(a) = 0. 

The 3rd iteration processes Node 4 and again finds NSV-edges only. So: 
TBDNSV(c) = {(1 ^ 2), (4 ^ 5)} and TBDNSV(6) = {(4 ^ 7)}. DONE be- 
comes {1,4}. 
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OutputCPG 

1 TBD -(r- {startnode} 

2 TBDNSV(^ £ -Cnsv) {array of sets; all sets initialized to 0} 

3 DONE.f-0 

4 while TBD / or 3^ : TBDNSV(£) do 



5 if TBD ^ then 

6 n •<— Element{TBD) {choose one element of set TBD} 

7 print n 

8 for all edges (n — >■ i) do 

9 if £{n — > i) G -Cnsv then 

10 TBDNSV(£(n i)) ^ TBDNSV(£(n -> j))U 

11 {(n^i)} 

12 else 

13 TBD TBD U {i} 

14 print (n — i) 

15 endif 

16 while ^7^ : i e 7^ and 37? : {(I',7^)} e RECONSIDER do 

17 {we have found a path back to a set of nodes 

18 which we have used to eliminate NSV edges; 

19 all these edges have now to be reconsidered} 

20 for (m -> j) e 7^ do 

21 TBDNSV(£(m j)) -f- TBDNSV(^(m j))\J 

22 {(m ^ j)} 

23 endfor 

24 RECONSIDER RECONSIDER \{(X),7^)} 

25 endwhile 

26 endfor 

27 DONE ^ DONE U{n}; TBD -f- TBD \ DONE 

28 else {TBD = } 

29 £ NonEmptyElement{TBDNS\/) 

30 {choose one label with non-empty set in TBDNSV} 

31 SUBSET •(- SmallestSubset{TBDm\/{e), DONE) 

32 {choose smallest subset of TBDNSV(<?) such that 

33 subset can be reached from all nodes in set DONE} 

34 if TBDNSV(£)\ SUBSET ^0 then 

35 RECONSIDER RECONSIDER U 

36 {(DONE, TBDNSV(£) \ SUBSET)} 

37 {remember eliminated edges; 

38 in case we find a path back to nodes in DONE, 

39 we have to reconsider these edges} 

40 endif 

41 for (n i) G SUBSET do 

42 print — > i) 

43 TBD ^ TBD U {i} 
AA endfor 

45 TBDNSV(£) 

46 endif 



47 endwhile 

Fig. 8. Output CPG 
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The 4th iteration chooses label b in Line 29. Thus SUBSET clearly is {(4 — 

7) }, TBD = {7}, and TBDNSV(6) = 0. 

The 5th iteration processes Node 7 and finds one NSV-edge labeled c. So: 
TBDNSV(c) = {(1 ^ 2), (4 ^ 5), (7 ^ 8)}. DONE becomes {1, 4, 7}. 

The 6th iteration handles label c. The smallest subset is found to be {(7 — >■ 

8) } since Node 7 can be reached from each of the nodes in set DONE = {1, 4, 7}. 
Hence, edges (1 — > 2) and (4 — > 5) can be eliminated, i.e., they are not printed. 
So: TBDNSV(c) = and TBD = {8}. 

The 7th iteration finds one NSV-edge labeled d. Thus we continue with 
TBDNSV(d) = {(8^9)}. DONE becomes {1, 4, 7, 8}. 

The 8th iteration handles label d. We obtain TBDNSV(rf) = and TBD = 
{9}. 

The 9th iteration prints Node 9, sets DONE = {1,4,7,8,9} and TBD = 0. 
The algorithm terminates and the result is depicted in Fig. 9. 




Fig. 9. Sequentialized C-D-System 



4 Client-Server Example 

Wc have done analysis on client-server scenarios using our lazy implementation. 
For the example presented here we have used clients and a semaphore of the 
form shown in Fig. 10(a) and 10(b), respectively. 

In Table 10(c) statistics for 1, 2, 4, 8, 16, and 32 clients are given. Fig. 10(d) 
shows the resulting graph for 8 clients. The few nodes in the resulting matrix and 
the node IDs indicate that most nodes in the resulting matrix arc superfluous. 
The case of 32 clients and one semaphore forms a matrix with an order of approx. 
3.706 X 10^^. Our implementation generated only 65 nodes in 0.43s. In fact we 
observed a linear growth in the number of clients for the number of nodes and 
edges and for the execution time. We did our analysis on an Intel Xeon 2.8 GHz 
with 8GB DDR2 RAM. Note that an implementation of the matrix calculus for 
shared memory concurrent systems has to provide node IDs of a sufficient size. 
The order ofToS can be quite big, although the resulting automaton is small. 

5 Generic Proof of Deadlock Freedom 

Let Si for i > 1 denote binary semaphores and let their operations be denoted 
by Pi and Vi. 
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(a) Client 



(b) Semaphore 



Clients 


Nodes 


Edges 


Exec. Time [s] 


Potential Nodes 


1 


3 


3 


0.0013 


6 


2 


5 


6 


0.0013 


18 


4 


9 


12 


0.0045 


162 


8 


17 


24 


0.0120 


13,122 


16 


33 


48 


0.0680 


86,093,422 


32 


65 


96 


0.4300 


3.706 xlO^^ 



(c) Statistics 




(d) Result for 8 Clients 

Fig. 10. Client-Server Example 

Definition 7. Let M = {rriij) € A4 denote a square matrix. In addition, let 
Vm = {{i,j,r) I rriij = Pr for some r > 1} and Vm = {{j,i,r) \ ruij = 
Vr for some r > 1} (note the exchanged indexes {j, i) ). We call M p-v-symmetric 
iffPM = Vm- 

By definition of Kronecker sum and Kronecker product, it is easy to prove the 
following lemma. 

Lemma 9. Let M and N he p-v-symmetric matrices. Then M ® N , M ^ N, 
and M o N are also p-v-symmetric. □ 

To be more specific, let S, = (^^ for i > 1. Then 6'('') = 0[^i Si is p-v- 
symmetric. 
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Now, consider the p-v-symmetric matrix 



f PlP2 ... Pk\ 

vi ... 
\vk ... J 



Thus M^"' = 0"^;^ Ml- is also p-v-symmetric. 
Now we state a theorem on deadlock freedom. 

Theorem 1. Let P = M^"' o S^''^ be the matrix of a n-threaded program with 
k binary semaphores, where M^jf^ and S^^^ are defined above. Then the program 
is deadlock free. 

Proof. By definition and Lemma 9 P is p-v-symmetric. By Corollary 1 a deadlock 
manifests itself by a zero line, say in matrix P. Since P is p-v-symmetric, 
column £ does only contain zeroes. Hence line ^ is unreachable in the underlying 
automaton. 

This clearly holds for all zero lines in P and thus the program is deadlock 
free. □ 

For counting semaphores we obtain matrices of the following type 



/OpO ■ 
V Qp ■ 
u • 





yo 



o\ 

00 
00 



p 



which clearly is p-v-symmctric. Thus a similar theorem holds if counting sema- 
phores are used instead of binary ones. 

A short reflection shows that if we allow Mk to contain additional entries and 
non-zero lines and columns which do not contain ps and ws, the system is still 
deadlock free. So, we have derived a very powerful criterion to ensure deadlock 
freedom for a large class of programs, namely p-v-symmetric programs. 

Concerning the example in Section 4 wc note that if edges labeled a are 
removed from the clients, we obtain p-v-symmetric matrices. Thus this simple 
client-server system is deadlock free for an arbitrary number of clients. If we 
reinsert edges labeled a into the clients, no zero lines and columns appear (as 
noted above), so that the system is still deadlock free for an arbitrary number 
of clients. 

Theorem 1 may be compared to the results of [8,5], whcire for homogenous 
token passing rings it is proved that checking correctness properties can be re- 
duced to rings of small sizes. 
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1 s .p 




2 r <— sw 


{edge a} 


3 r r+ 1 


{edge b} 


4 r 


{edge b} 


5 s.v 


{edge Tl.v} 


r2() 




1 t <^ sv 


{edge c} 


2 s.p 


{edge T2.p} 


3 


{edge d} 


4 sv ■i^t 


{edge d} 


5 s.v 


{edge T2.v} 



Fig. 11. Example Program 




(a) Tl (b) T2 



Fig. 12. RCFGs after Edge Splitting 

6 A Data Race Example 



We give an example, where a programmer is supposed to have used synchroniza- 
tion primitives in a wrong way. The program consisting of two threads, namely 
Tl and T2, and a semaphore s is given in Fig. 11. We assume that sw = at 
program start. It is supposed that the program dehvers sv = 2 when it termi- 
nates. Both threads in the program access the shared variable sv. The variables 
r and t are local to the corresponding threads. The programmer inadvertently 
has placed line 1 in front of line 2 in T2 . 

After edge splitting we get the RCFGs depicted in Fig. 12. As usual the 
semaphore looks like Fig. 1(a). The corresponding matrices are 



Tl = 



/OTi.pOO 
aO 
& 
OOTi.t; 

\0 00 



\ /Oc \ 

OOT2.PO 

and T2 = d 
T2.V 

) \oo y 
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Although the foUowing matrices are not computed by our lazy implementa- 
tion, we give them here to allow the reader to see a complete example. To enable 
a concise presentation we define the following submatrices of order five: 



/O c \ 
T2.p 
H= d 

T2.V 

\oo y 



J = a - l5,K = b- 15, and L = 



(Ti.p \ 
Ti.p 
Ti.p 
Ti.p 
V Ti.pJ 

(Ti.v \ 
Ti.v 
Ti.v 
T-i_.v 
\ Ti.vJ 



Now, we get T = Ti © T2, a matrix of order 25, consisting of the submatrices 
defined above and zero matrices of order five (instead of simply denoted by 
0). 



/iJ J 0\ 
ff J 
Q H K {) 
H L 

\0 HJ 



To shorten the presentation of P = T o S' we define the following submatrices of 
order ten: 



U = 



/OOcOO 00 






/OTi.pO 


\ 


OOOcO 00 



















OOOOOTs.pOO 













OTi.pO 





00000 00 



















(iO 





,v = 








OTi.p 





00000 Od 

















00000 00 
















OTi.pO 


OOT2.V 



















00000 00 
















OTi.p 


i^OOOOO 00 


V 




\0 








y 



25 



W = a-hQ, X = b-ho, andF 



/ n 














U 1 


T,.v 






































OTi.?;0 



































OTi.vO 



































OTi.v 





























^ 








Ti.v 


0/ 



With the help of zero matrices of order ten we can state the program's matrix 




Fig. 13. Resulting CPG 
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P = To S = To 



Op 
V 



/UV 0\ 
UW 
U X 
UY 

\0 UJ 



Matrix P has order 50. The corresponding CPG is shown in Fig. 13. The 
lazy implementation computes only these 19 nodes. Due to synchronization the 
other parts are not reachable. In addition to the usual labels wc have add a set 
of tuples to each edge in the CPG of Fig. 13. Tuple {x,y,z) denotes values of 
variables, such that sv = x, r = y and t = z. We use _L to refer to an undefined 
value. A triple shows the values after the basic block on the corresponding edge 
has been evaluated. The entry node of the CPG is Node 1. At program start we 
have the variable assignment (0, _L, _L). At Node 49 we result in the set of tuples 
{(1, 1, 1), (2, 1, 2), (2, 2, 1)}. Due to the interleavings different tuples may occur 
at join nodes. This we reflect by a set of tuples. As stated above the program is 
supposed to deliver sv = 2. Thus the tuple (1,1,1) shows that the program is 
erroneous. The error is caused by a data race between the edges c of thread T2 
and the edges a and b of thread Ti. 



7 Empirical Data 

In Sec. 4 we already gave some empirical data concerning client-server examples. 
In this section we give empirical data for ten additional examples. 

Let o{P) and o(C) refer to the order of the adjacency matrix P, which is not 
computed by our lazy implementation, and the order of the adjacency matrix C 
of the resulting CPG, respectively. In addition k and r refer to the number of 
threads and the number of semaphores, respectively. 



k 


r 


o(P) 




0(C) 


Runtime [s] 


2 


4 


256 


16,00 


12 


0,03 


3 


5 


4800 


69,28 


30 


0,097542 


4 


6 


124416 


352,73 


98 


0,48655 


3 


6 


75264 


274,34 


221 


1,057529 


4 


7 


614400 


783,84 


338 


2,537082 


4 


8 


1536000 


1239,35 


277 


2,566587 


4 


8 


737280 


858,65 


380 


3,724364 


4 


13 


298721280 


17283,56 


2583 


96,024073 


4 


11 


55050240 


7419,58 


3908 


146,81 


5 


6 


14929920 


3863,93 


7666 


309,371395 



Table 1. Empirical Data 
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In the following we use the data depicted in Table 1.* The numbers in the 
third column are rounded to two decimal places. As a first observation we note 
that except for one example all values of o(C) arc smaller as the corresponding 
values of ^ {o{P)). In addition, the runtime of our implementation shows a 
strong correlation to the order o(C) of the adjacency matrix C of the generated 
CPG with a Pearson product-moment correlation coefficient of 0,9990277130. In 
contrast the values of the theoretical order o(P) of the resulting adjacency matrix 
P correlates to the runtime only with a correlation coefficient of 0.2370050995.^ 

This observations show that the rimtimc complexity docs not depend on the 
order o{P) which grows exponentially in the number of threads. We conclude 
this section by stating that the collected data give strong indication that the 
runtime complexity of our approach is linear in the number of nodes present in 
the resulting CPG. 

8 Related Work 

Probably the closest work to ours was done by Buchholz and Kemper [3]. It 
differs from our work as stated in the following. We establish a framework for 
analyzing multithreaded shared memory concurrent systems which forms a ba- 
sis for studying various properties of the program. Different techniques including 
dataflow analysis (e.g. [23-25,14]) and model checking (e.g. [6,9] to name only 
a few) can be applied to the generated CPGs. In this paper we use our approach 
in order to prove deadlock freedom. Buchholz and Kemper worked on gener- 
ating reachability sets in composed automata. Our approach uses CFGs and 
semaphores to model shared memory concurrent programs. Buchholz and Kem- 
per use it for describing networks of synchronized automata. Both approaches 
employ Kronecker algebra. An additional difference is that we propose optimiza- 
tions concerning the handling of edges not accessing shared variables and lazy 
evaluation of the matrix entries. 

In [9] Ganai and Gupta studied modeling concurrent systems for bounded 
model checking (BMC). Somehow similar to our approach the concurrent system 
is modeled lazily. In contrast our approach does not need temporal logic spec- 
ifications like LTL for proving deadlock freedom for p-v-symmetric programs 
but on the other hand our approach may suffer from false positives. Like all 
BMC approaches [9] has the drawback that it can only show correctness within 
a bounded number of k steps. 

Kahlon et al. propose a framework for static analysis of concurrent pro- 
grams in [13]. Partial order reduction and synchronization constraints are used 
to reduce thread interleavings. In order to gain further reductions abstract in- 
terpretation is applied. 

In [22] a model checking tool is presented that builds up a system gradually, 
at each stage compressing the subsystems to find an equivalent CSP process 

* We did our analysis on an Intel Pentium D 3.0 GHz machine with 1GB DDR RAM 

running CentOS 5.6. 
® Both correlation coefficients are rounded to ten decimal places. 
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with many less states. With this approach systems of exponential size (> 10^°) 
can be model checked successfully. This can be compared to our client-server 
example in Sect. 4, where matrices of exponential size can be handled in linear 
time. 

Although not closely related we recognize the work done in the field of 
stochastic automata networks (SAN) which is based on the work of Plateau [19] 
and in the field of generalized stochastic petri nets (GSPN) (e.g. [4]) as related 
work. Compared to ours these fields are completely different. Nevertheless, basic 
operators are shared and some properties influenced this paper. 

9 Conclusion 

We established a framework for analyzing multithreaded shared memory concur- 
rent systems which forms a basis for studying various properties of programs. 
Different techniques including dataflow analysis and model checking can be ap- 
plied to CPGs. In addition, the structure of the matrices can be used to prove 
properties of the underlying program for an arbitrary number of threads. In this 
paper we used CPGs in order to prove deadlock freedom for the large class of 
p-v-symmetric programs. 

Furthermore, we proved that in general CPGs can be represented by sparse 
matrices. Hence the number of entries in the matrices is linear in their number 
of lines. Thus efficient algorithms can be applied to CPGs. 

We proposed two major optimizations. First, if the program contains a lot 
of synchronization, only a very small part of the CPG is reachable and, due 
to a lazy implementation of the matrix operations, only this part is computed. 
Second, if the program has only little synchronization, many edges not accessing 
shared variables will be present, which are reduced during the output process of 
the CPG. Both optimizations speed up processing significantly and show that 
this approach is very promising. 

We gave examples for both, the lazy implementation and how we are able to 
prove deadlock freedom. 

The first results of our approach (such as Theorem 1) and the performance of 
our prototype implementation arc very promising. Further research is needed to 
generalize Theorem 1 in order to handle systems similar to the Dining Philoso- 
phers problem. In addition, details on how to perform (complete and sound) 
dataflow analysis on CPGs have to be studied. 
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